In [166]:
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import DataFrame
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.base import TransformerMixin
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import f_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import FeatureUnion

In [167]:
data_dict = pickle.load(open("../ud120-projects/final_project/final_project_dataset.pkl", "r") )

Holdout

Since the dataset for this project is so small, a hold-out set will not be used, and only k-fold testing and training splits will be used to measure accuracy.

This is because even with a stratified hold-out set of 20%, with only 146 data points, lots of missing data and and 18 poi's, there would be only 3 or so points to do a final test on. This does not give much confidence in the precision of the performance metrics on such a small hold-out set, while also negatively impacting the ability to create the model.

"when the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. (...) Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. "

[1] Kuhn M., Kjell J.(2013). Applied Predictive Modeling. Springer. pp.67

Hawkins et al. (2003) concisely summarize this point:“holdout samples of tolerable size [. . . ] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate.”

[2] Hawkins D, Basak S, Mills D (2003). “Assessing Model Fit by Cross– Validation.” Journal of Chemical Information and Computer Sciences, 43(2), 579–586

This will be addressed with K-fold cross-validation resampling techniques.

Version 2 - Cross Validation Scheme

  1. Define sets of model parameters values to evaluate
  2. for each parameter set in grid search DO
    1. For each k-fold resampling iteration DO
      1. Hold-out 1/k samples/fold
      2. Pre-Process Data (Create functions on training set, apply to test set with same)
        1. Impute data (median)
        2. Scale features (x_i - mean))/std
        3. Perform any univariate feature selection (remove very low variation features)
        4. Modeling feature selection (ExtraTreesClassifier)
      3. Fit the model on the k/K training fold
      4. Predict the hold-out samples/fold
    2. END
    3. Calculate the average performance across hold-out predictions
  3. END
  4. Determine the optimal parameter set
  5. Fit the final model to all training data using the optimal parameter set

In [168]:
df = pd.DataFrame.from_dict(data_dict, orient='index')

'NaN' was imported as a string instead of a a missing value. We will convert these to NaN type and look how many missing values our data has.


In [169]:
# Replace 'NaN' strings with actual np.nan values
df = df.replace('NaN', np.nan)
# Replace email strings with True/False boolean as to whether an email was present or not
df['email_address'] = df['email_address'].fillna(0).apply(lambda x: x != 0, 1)

In [170]:
# Replace with index watcher
# A quick look at the original finanical spreadsheet shows TOTAL at the bottom 
# totaling all entries for everyone. This is obviously an outlier with no 
# meaningful information and can be removed.

# df[df['salary'] > 1000000]
# df[df.index == 'TOTAL']
df = df.drop('TOTAL', axis=0)

In [170]:

By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.

http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

Remove columns with less than 50% of entries present.

Remove rows with no non-NA values


In [171]:
# low_var_remover = VarianceThreshold(threshold=.5)

In [172]:
# ************************
# Encode as 0 instead.
# Remove columns with more than 50% NA's
# df_50 = df.dropna(axis=1, thresh=len(df)/2)
# ************************

# Since email_address and poi are True/False, every record should have at least 2 non-NA.
# We'll next remove any rows that don't have at least 2 non-NA values besides these.
# The criteria is: No more than 11 NA's per row.
# df_50 = df_50.dropna(axis=0, thresh=5)

# 128 records remain.
# df_50.info()

Financial NA's

When looking at the source of the data, the NA entries in the financial data seem values that are reported as zero since all payments/stock values add up to the total payments/stocks values. These NA values should then be set to 0 to add up to the totals reported by the accounting spreadsheet.

Email statistics NA's

The missing values for NA's for email statistics may be a little more subjective.

  1. Some email statistics are features created with prior knowledge of the entire dataset (i.e. emails to/from poi's). This may be data snooping, since if new data/pois were somehow introduced, it would not be possible to generate these features without prior knowledge of which new data were the poi's.

  2. NA's here imply that the person did not have an email account with Enron, or were not involved in emailing by some other way.

This means all email data features ar NA if even one column had missing email data for that person. It is hard to judge any distribution that they could have if they were given an email account since they have no ties to the financial data to infer distributions.

We have no real way to infer a person having sent/recieved 10 emails or 10,000 from completely unrelated financial data from a different dataset with many different people.

For this reason, these NA will also be encoded as 0.


In [173]:
df = df.apply(lambda x: x.fillna(0), axis=0)

In [173]:

Imputation


In [174]:
import seaborn as sns
sns.set(style='darkgrid')

f, ax = plt.subplots(figsize=(14, 14))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
sns.corrplot(df.corr(), annot=True, sig_stars=False,
             diag_names=False, cmap=cmap, ax=ax)
f.tight_layout()



In [175]:
# Pick a column which we are predicting.
# Find other variables correlated to used KMeansNeighborsRegression to predict/impute
# the missing values.
# df_50.corr().ix[: ,'salary']

In [176]:
# cols1 = ['salary', 'other', 'total_stock_value', 'exercised_stock_options', 
#        'total_payments', 'restricted_stock']
# Bonus and salary values don't seem to be missing at random. Anytime there is a null value
# for salary, there is also one for bonus. So bonus can't be used to predict salary on
# the first pass. Predicted salary values will be used to predict bonus values though 
# on a second pass.
# cols2= ['salary', 'other', 'total_stock_value', 'exercised_stock_options', 
#        'total_payments', 'restricted_stock', 'bonus']
# cols3 = ['to_messages', 'from_this_person_to_poi', 'from_messages', 
# 'shared_receipt_with_poi', 'from_poi_to_this_person']

In [177]:
def kcluster_null(df=None, cols=None, process_all=True):
    '''
    Input: Takes pandas dataframe with values to impute, and a list of columns to impute
        and use for imputing.
    Returns: Pandas dataframe with null values imputed for list of columns passed in.
    
    # Ideally columns should be somewhat correlated since they will be used in KNN to
    # predict each other, one column at a time.
    
    '''
    
    # Create a KNN regression estimator for 
    income_imputer = KNeighborsRegressor(n_neighbors=1)
    # Loops through the columns passed in to impute each one sequentially.
    
    if not process_all:
        to_pred = cols[0]
        predictor_cols = cols[1:]
        
        
    for each in cols:
        # Create a temp list that does not include the column being predicted.
        temp_cols = [col for col in cols if col != each]
        # Create a dataframe that contains no missing values in the columns being predicted.
        # This will be used to train the KNN estimator.
        df_col = df[df[each].isnull()==False]
        
        # Create a dataframe with all of the nulls in the column being predicted.
        df_null_col = df[df[each].isnull()==True]
        
        # Create a temp dataframe filling in the medians for each column being used to
        # predict that is missing values.
        # This step is needed since we have so many missing values distributed through 
        # all of the columns.
        temp_df_medians = df_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        
        # Fit our KNN imputer to this dataframe now that we have values for every column.
        income_imputer.fit(temp_df_medians, df_col[each])
        
        # Fill the df (that has null values being predicted) with medians in the other
        # columns not being predicted.
        # ** This currently uses its own medians and should ideally use the predictor df's
        # ** median values to fill in NA's of columns being used to predict.
        temp_null_medians = df_null_col[temp_cols].apply(lambda x: x.fillna(x.median()), axis=0)
        
        # Predict the null values for the current 'each' variable.
        new_values = income_imputer.predict(temp_null_medians[temp_cols])

        # Replace the null values of the original null dataframe with the predicted values.
        df_null_col[each] = new_values
        
        # Append the new predicted nulls dataframe to the dataframe which containined
        # no null values.
        # Overwrite the original df with this one containing predicted columns. 
        # Index order will not be preserved since it is rearranging each time by 
        # null values.
        df = df_col.append(df_null_col)
        
    # Returned final dataframe sorted by the index names.
    return df.sort_index(axis=0)

In [177]:


In [178]:
df.irow(127)


Out[178]:
salary                            0
to_messages                       0
deferral_payments                 0
total_payments               362096
exercised_stock_options           0
bonus                             0
restricted_stock                  0
shared_receipt_with_poi           0
restricted_stock_deferred         0
total_stock_value                 0
expenses                          0
loan_advances                     0
from_messages                     0
other                        362096
from_this_person_to_poi           0
poi                           False
director_fees                     0
deferred_income                   0
long_term_incentive               0
email_address                 False
from_poi_to_this_person           0
Name: THE TRAVEL AGENCY IN THE PARK, dtype: object

In [179]:
#cols = [x for x in df.columns]
#for each in cols:
#    g = sns.FacetGrid(df, col='poi', margin_titles=True, size=6)
#    g.map(plt.hist, each, color='steelblue')

In [180]:
from pandas.tools.plotting import scatter_matrix

In [181]:
list(df.columns)


Out[181]:
['salary',
 'to_messages',
 'deferral_payments',
 'total_payments',
 'exercised_stock_options',
 'bonus',
 'restricted_stock',
 'shared_receipt_with_poi',
 'restricted_stock_deferred',
 'total_stock_value',
 'expenses',
 'loan_advances',
 'from_messages',
 'other',
 'from_this_person_to_poi',
 'poi',
 'director_fees',
 'deferred_income',
 'long_term_incentive',
 'email_address',
 'from_poi_to_this_person']

In [182]:
financial_cols = np.array(['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive'])

email_cols = np.array(['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address'])

In [183]:
from sklearn.ensemble import RandomForestClassifier

In [184]:
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[financial_cols], df['poi'])


Out[184]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [185]:
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

In [186]:
padding = np.arange(len(financial_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, financial_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [187]:
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[email_cols], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(email_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, email_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [188]:
all_cols = np.concatenate([email_cols, financial_cols])
clf = RandomForestClassifier(n_estimators=1000)
clf.fit(df[all_cols], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(all_cols)) + 0.5
plt.figure(figsize=(14, 12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, all_cols[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()



In [189]:
df['ex_stock_bins'] = pd.cut(df.exercised_stock_options, bins=15, labels=False)
pd.value_counts(df.ex_stock_bins)


Out[189]:
0     118
1      11
2       6
3       4
8       2
14      1
13      1
6       1
4       1
dtype: int64

In [190]:
df.exercised_stock_options.plot()


Out[190]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ecc1ac8>

In [191]:
def capValues(x, cap):
    return (cap if x > cap else x)

In [192]:
df.exercised_stock_options = df.exercised_stock_options.apply(lambda x: capValues(x, 5000000))

In [193]:
df['ex_stock_bins'] = pd.cut(df.exercised_stock_options, bins=15, labels=False)
pd.value_counts(df.ex_stock_bins)


Out[193]:
0     60
1     18
14    16
2     13
4     12
6      7
3      5
5      4
12     3
7      3
13     2
9      2
dtype: int64

In [194]:
df[['ex_stock_bins', 'poi']].groupby('ex_stock_bins').mean().plot()


Out[194]:
<matplotlib.axes._subplots.AxesSubplot at 0x179937f0>

In [195]:
df.columns


Out[195]:
Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments', u'exercised_stock_options', u'bonus', u'restricted_stock', u'shared_receipt_with_poi', u'restricted_stock_deferred', u'total_stock_value', u'expenses', u'loan_advances', u'from_messages', u'other', u'from_this_person_to_poi', u'poi', u'director_fees', u'deferred_income', u'long_term_incentive', u'email_address', u'from_poi_to_this_person', u'ex_stock_bins'], dtype='object')

In [196]:
df[['bonus', 'poi']].groupby('bonus').mean().plot()


Out[196]:
<matplotlib.axes._subplots.AxesSubplot at 0x1caeb780>

In [197]:
df.shared_receipt_with_poi.plot()


Out[197]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b1ce9e8>

In [198]:
max(df.shared_receipt_with_poi)


Out[198]:
5521.0

In [199]:
# Create bins for shared receipt with poi
my_bins = [min(df.shared_receipt_with_poi)] + [250] + range(500, 5000, 500) + [max(df.shared_receipt_with_poi)]
df['shared_poi_bins'] = pd.cut(df.shared_receipt_with_poi, bins=my_bins, labels=False, include_lowest=True)
pd.value_counts(df['shared_poi_bins'])


Out[199]:
0     81
2     19
5     11
3      9
1      9
4      6
8      4
6      4
10     2
dtype: int64

In [199]:


In [200]:
df[['shared_poi_bins', 'poi']].groupby('shared_poi_bins').mean().plot()


Out[200]:
<matplotlib.axes._subplots.AxesSubplot at 0x16ed2e10>

In [201]:
df.total_stock_value


Out[201]:
ALLEN PHILLIP K          1729541
BADUM JAMES P             257817
BANNANTINE JAMES M       5243487
BAXTER JOHN C           10623258
BAY FRANKLIN R             63014
BAZELIDES PHILIP J       1599641
BECK SALLY W              126027
BELDEN TIMOTHY N         1110705
BELFER ROBERT             -44093
BERBERIAN DAVID          2493616
BERGSIEKER RICHARD P      659249
BHATNAGAR SANJAY               0
BIBI PHILIPPE A          1843816
BLACHMAN JEREMY M         954354
BLAKE JR. NORMAN P             0
...
UMANOFF ADAM S                  0
URQUHART JOHN A                 0
WAKEHAM JOHN                    0
WALLS JR ROBERT H         5898997
WALTERS GARETH W          1030329
WASAFF GEORGE             2056427
WESTFAHL RICHARD K         384930
WHALEY DAVID A              98718
WHALLEY LAWRENCE G        6079137
WHITE JR THOMAS E        15144123
WINOKUR JR. HERBERT S           0
WODRASKA JOHN                   0
WROBEL BRUCE               139130
YEAGER F SCOTT           11884758
YEAP SOON                  192758
Name: total_stock_value, Length: 145, dtype: float64

In [201]:


In [202]:
from sklearn.preprocessing import StandardScaler

df['total_stock_scaled'] = StandardScaler().fit_transform(df[['total_stock_value']])
df['bonus_scaled'] = StandardScaler().fit_transform(df[['bonus']])

print df.total_stock_scaled.describe() plt.hist(df.total_stock_scaled)


In [203]:
def dont_neg_log(x):
    if x >=0:
        return np.log1p(x)
    else:
        return 0
    
df['stock_log'] = df['total_stock_value'].apply(lambda x: dont_neg_log(x))

Feature Ratio Creation


In [204]:
financial_cols = np.array(['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive'])

email_cols = np.array(['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address'])

In [205]:
payment_comp = ['salary', 'deferral_payments','bonus', 'expenses', 'loan_advances',
                'other', 'director_fees', 'deferred_income', 'long_term_incentive']
payment_total = ['total_payments']

stock_comp = ['exercised_stock_options', 'restricted_stock','restricted_stock_deferred',]
stock_total = ['total_stock_value']

all_comp = payment_comp + stock_comp

email_comp = ['shared_receipt_with_poi', 'from_this_person_to_poi', 'from_poi_to_this_person' ]
email_totals = ['from_messages', 'to_messages'] # interaction_w_poi = total(from/to/shared poi)

In [205]:


In [206]:
df['total_compensation'] = df['total_payments'] + df['total_stock_value']

for each in payment_comp:
    df['{0}_{1}_ratio'.format(each, 'total_pay')] = df[each]/df['total_payments']

for each in stock_comp:
    df['{0}_{1}_ratio'.format(each, 'total_stock')] = df[each]/df['total_stock_value']

for each in all_comp:
    df['{0}_{1}_ratio'.format(each, 'total_compensation')] = df[each]/df['total_compensation']
    
    
df['total_poi_interaction'] = df['shared_receipt_with_poi'] + df['from_this_person_to_poi'] + \
df['from_poi_to_this_person']

for each in email_comp:
    df['{0}_{1}_ratio'.format(each, 'total_poi_int')] = df[each]/df['total_poi_interaction']

df['total_active_poi_interaction'] = df['from_this_person_to_poi'] + df['from_poi_to_this_person']
df['to_poi_total_active_poi_ratio'] = df['from_this_person_to_poi']/df['total_active_poi_interaction']
df['from_poi_total_active_poi_ratio'] = df['from_poi_to_this_person']/df['total_active_poi_interaction']

df['to_messages_to_poi_ratio'] = df['from_this_person_to_poi']/ df['to_messages']
df['from_messages_from_poi_ratio'] = df['from_poi_to_this_person']/df['from_messages']
df['shared_poi_from_messages_ratio'] = df['shared_receipt_with_poi']/df['from_messages']

A good portion of people were paid either only in stock or payments. Another good portion also didn't have email statistics available.

These ratios will need to be set to zero manually due to division by 0 - NaN.


In [207]:
df = df.apply(lambda x: x.fillna(0), axis=0)

In [209]:
df[['poi', 'from_messages', 'to_messages', 'shared_receipt_with_poi','total_active_poi_interaction']]


Out[209]:
poi from_messages to_messages shared_receipt_with_poi total_active_poi_interaction
ALLEN PHILLIP K False 2195 2902 1407 112
BADUM JAMES P False 0 0 0 0
BANNANTINE JAMES M False 29 566 465 39
BAXTER JOHN C False 0 0 0 0
BAY FRANKLIN R False 0 0 0 0
BAZELIDES PHILIP J False 0 0 0 0
BECK SALLY W False 4343 7315 2639 530
BELDEN TIMOTHY N True 484 7991 5521 336
BELFER ROBERT False 0 0 0 0
BERBERIAN DAVID False 0 0 0 0
BERGSIEKER RICHARD P False 59 383 233 4
BHATNAGAR SANJAY False 29 523 463 1
BIBI PHILIPPE A False 40 1607 1336 31
BLACHMAN JEREMY M False 14 2475 2326 27
BLAKE JR. NORMAN P False 0 0 0 0
BOWEN JR RAYMOND M True 27 1858 1593 155
BROWN MICHAEL False 41 1486 761 14
BUCHANAN HAROLD G False 125 1088 23 0
BUTTS ROBERT H False 0 0 0 0
BUY RICHARD B False 1053 3523 2333 227
CALGER CHRISTOPHER F True 144 2598 2188 224
CARTER REBECCA C False 15 312 196 36
CAUSEY RICHARD A True 49 1892 1585 70
CHAN RONNIE False 0 0 0 0
CHRISTODOULOU DIOMEDES False 0 0 0 0
CLINE KENNETH W False 0 0 0 0
COLWELL WESLEY True 40 1758 1132 251
CORDES WILLIAM R False 12 764 58 10
COX DAVID False 33 102 71 4
CUMBERLAND MICHAEL S False 0 0 0 0
... ... ... ... ... ...
SCRIMSHAW MATTHEW False 0 0 0 0
SHANKMAN JEFFREY A False 2681 3221 1730 177
SHAPIRO RICHARD S False 1215 15149 4527 139
SHARP VICTORIA T False 136 3136 2477 30
SHELBY REX True 39 225 91 27
SHERRICK JEFFREY B False 25 613 583 57
SHERRIFF JOHN R False 92 3187 2103 51
SKILLING JEFFREY K True 108 3627 2042 118
STABLER FRANK False 0 0 0 0
SULLIVAN-SHAKLOVITZ COLLEEN False 0 0 0 0
SUNDE MARTIN False 38 2647 2565 50
TAYLOR MITCHELL S False 29 533 300 0
THE TRAVEL AGENCY IN THE PARK False 0 0 0 0
THORN TERENCE H False 41 266 73 0
TILNEY ELIZABETH A False 19 460 379 21
UMANOFF ADAM S False 18 111 41 12
URQUHART JOHN A False 0 0 0 0
WAKEHAM JOHN False 0 0 0 0
WALLS JR ROBERT H False 146 671 215 17
WALTERS GARETH W False 0 0 0 0
WASAFF GEORGE False 30 400 337 29
WESTFAHL RICHARD K False 0 0 0 0
WHALEY DAVID A False 0 0 0 0
WHALLEY LAWRENCE G False 556 6019 3920 210
WHITE JR THOMAS E False 0 0 0 0
WINOKUR JR. HERBERT S False 0 0 0 0
WODRASKA JOHN False 0 0 0 0
WROBEL BRUCE False 0 0 0 0
YEAGER F SCOTT True 0 0 0 0
YEAP SOON False 0 0 0 0

145 rows × 5 columns


In [210]:
df[df['poi']==True]


Out[210]:
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... total_poi_interaction shared_receipt_with_poi_total_poi_int_ratio from_this_person_to_poi_total_poi_int_ratio from_poi_to_this_person_total_poi_int_ratio total_active_poi_interaction to_poi_total_active_poi_ratio from_poi_total_active_poi_ratio to_messages_to_poi_ratio from_messages_from_poi_ratio shared_poi_from_messages_ratio
BELDEN TIMOTHY N 213999 7991 2144013 5501630 953136 5249999 157569 5521 0 1110705 ... 5857 0.942633 0.018439 0.038928 336 0.321429 0.678571 0.013515 0.471074 11.407025
BOWEN JR RAYMOND M 278601 1858 0 2669589 0 1350000 252055 1593 0 252055 ... 1748 0.911327 0.008581 0.080092 155 0.096774 0.903226 0.008073 5.185185 59.000000
CALGER CHRISTOPHER F 240189 2598 0 1639297 0 1250000 126027 2188 0 126027 ... 2412 0.907131 0.010365 0.082504 224 0.111607 0.888393 0.009623 1.381944 15.194444
CAUSEY RICHARD A 415189 1892 0 1868758 0 1000000 2502063 1585 0 2502063 ... 1655 0.957704 0.007251 0.035045 70 0.171429 0.828571 0.006342 1.183673 32.346939
COLWELL WESLEY 288542 1758 27610 1490344 0 1200000 698242 1132 0 698242 ... 1383 0.818510 0.007954 0.173536 251 0.043825 0.956175 0.006257 6.000000 28.300000
DELAINEY DAVID W 365163 3093 0 4747979 2291113 3000000 1323148 2097 0 3614261 ... 2772 0.756494 0.219697 0.023810 675 0.902222 0.097778 0.196896 0.021505 0.683284
FASTOW ANDREW S 440698 0 0 2424083 0 1300000 1794412 0 0 1794412 ... 0 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000
GLISAN JR BEN F 274975 873 0 1272284 384728 600000 393818 874 0 778546 ... 932 0.937768 0.006438 0.055794 58 0.103448 0.896552 0.006873 3.250000 54.625000
HANNON KEVIN P 243293 1045 0 288682 5000000 1500000 853064 1035 0 6391065 ... 1088 0.951287 0.019301 0.029412 53 0.396226 0.603774 0.020096 1.000000 32.343750
HIRKO JOSEPH 0 0 10259 91093 5000000 0 0 0 0 30766064 ... 0 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000
KOENIG MARK E 309946 2374 0 1587421 671737 700000 1248318 2271 0 1920055 ... 2339 0.970928 0.006413 0.022659 68 0.220588 0.779412 0.006318 0.868852 37.229508
KOPPER MICHAEL J 224305 0 0 2652612 0 800000 985032 0 0 985032 ... 0 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000
LAY KENNETH L 1072321 4273 202911 103559793 5000000 7000000 14761694 2411 0 49110078 ... 2550 0.945490 0.006275 0.048235 139 0.115108 0.884892 0.003744 3.416667 66.972222
RICE KENNETH D 420636 905 0 505050 5000000 1750000 2748364 864 0 22542539 ... 910 0.949451 0.004396 0.046154 46 0.086957 0.913043 0.004420 2.333333 48.000000
RIEKER PAULA H 249201 1328 214678 1099100 1635238 700000 283649 1258 0 1918887 ... 1341 0.938106 0.035794 0.026100 83 0.578313 0.421687 0.036145 0.426829 15.341463
SHELBY REX 211844 225 0 2003885 1624396 200000 869220 91 0 2493616 ... 118 0.771186 0.118644 0.110169 27 0.518519 0.481481 0.062222 0.333333 2.333333
SKILLING JEFFREY K 1111258 3627 0 8682716 5000000 5600000 6843672 2042 0 26093672 ... 2160 0.945370 0.013889 0.040741 118 0.254237 0.745763 0.008271 0.814815 18.907407
YEAGER F SCOTT 158403 0 0 360300 5000000 0 3576206 0 0 11884758 ... 0 0.000000 0.000000 0.000000 0 0.000000 0.000000 0.000000 0.000000 0.000000

18 rows × 61 columns

director_fees_total_pay_ratio, deferred_income_total_pay_ratio, exercised_stock_options_total_stock_ratio, exercised_stock_options_total_stock_ratio, restricted_stock_deferred_total_stock_ratio, restricted_stock_total_stock_ratio, director_fees_total_compensation_ratio, deferred_income_total_compensation_ratio, restricted_stock_total_compensation_ratio, restricted_stock_deferred_total_compensation_ratio


In [238]:
# Column slicing by number
df.ix[:,5:10].describe()


Out[238]:
bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value
count 145.000000 145.000000 145.000000 145.000000 145.000000
mean 671335.303448 862546.386207 697.765517 72911.572414 2889718.124138
std 1230147.632511 2010852.212383 1075.128126 1297469.064327 6172223.035654
min 0.000000 -2604490.000000 0.000000 -1787380.000000 -44093.000000
25% 0.000000 0.000000 0.000000 0.000000 221141.000000
50% 300000.000000 360528.000000 114.000000 0.000000 955873.000000
75% 800000.000000 698920.000000 900.000000 0.000000 2282768.000000
max 8000000.000000 14761694.000000 5521.000000 15456290.000000 49110078.000000

In [212]:
#all_cols2 = np.concatenate([all_cols, np.array(['shared_poi_bins', 'ex_stock_bins', 
#                                                'total_stock_scaled', 'bonus_scaled',
#                                                'stock_log'])])

features = np.array(df.drop('poi', axis=1).columns)

clf = ExtraTreesClassifier(n_estimators=2000)
clf.fit(df[features], df['poi'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
plt.figure(figsize=(14,12))
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-212-297fbf6db390> in <module>()
      6 
      7 clf = ExtraTreesClassifier(n_estimators=2000)
----> 8 clf.fit(df[features], df['poi'])
      9 
     10 importances = clf.feature_importances_

c:\Anaconda\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
    222 
    223         # Convert data
--> 224         X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
    225 
    226         # Remap output

c:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in check_arrays(*arrays, **options)
    281                     array = np.asarray(array, dtype=dtype)
    282                 if not allow_nans:
--> 283                     _assert_all_finite(array)
    284 
    285             if not allow_nd and array.ndim >= 3:

c:\Anaconda\lib\site-packages\sklearn\utils\validation.pyc in _assert_all_finite(X)
     41             and not np.isfinite(X).all()):
     42         raise ValueError("Input contains NaN, infinity"
---> 43                          " or a value too large for %r." % X.dtype)
     44 
     45 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [91]:
confusion_matrix(df['poi'], clf.predict(df))


Out[91]:
array([[127,   0],
       [  0,  18]])

In [43]:
X_df = df[all_cols2]
y_df = df['poi']

Train


In [44]:
FINANCIAL_FIELDS = ['salary', 'deferral_payments', 'total_payments', 'exercised_stock_options', 
                  'bonus', 'restricted_stock', 'restricted_stock_deferred', 'total_stock_value',
                  'expenses', 'loan_advances', 'other', 'director_fees', 'deferred_income', 
                  'long_term_incentive', 'ex_stock_bins', 'stock_log']

EMAIL_FIELDS = ['from_messages', 'to_messages', 'shared_receipt_with_poi', 
              'from_this_person_to_poi', 'from_poi_to_this_person', 'email_address',
              'shared_poi_bins']

In [56]:
class ColumnExtractor(TransformerMixin):
    '''
    Columns extractor transformer for sklearn pipelines.
    Inherits fit_transform() from TransformerMixin, but this is explicitly
    defined here for clarity.
    
    Methods to extract pandas dataframe columns are defined for this class.
    
    '''
    def __init__(self, columns=[]):
        self.columns = columns
    
    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)
    
    def transform(self, X, **transform_params):
        '''
        Input: A pandas dataframe and a list of column names to extract.
        Output: A pandas dataframe containing only the columns of the names passed in.
        '''
        return X[self.columns]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def get_params(self, deep=True):
        """Get parameters for this estimator.
        Parameters
        ----------
        deep: boolean, optional
            If True, will return the parameters for this estimator and
            contained subobjects that are estimators.
        Returns
        -------
        params : mapping of string to any
            Parameter names mapped to their values.
        """

        return self

In [87]:
X_df = df[['total_payments', 'total_stock_value', 'shared_receipt_with_poi', 'bonus']]
y_df = df['poi']

from sklearn.svm import LinearSVC
sk_fold = StratifiedShuffleSplit(y_df, n_iter=10, test_size=0.2) 
        
pipeline = Pipeline(steps=[#('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           ('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           ('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           #('feature_selection', LinearSVC()),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features='auto', min_samples_split=2,
                                                       min_samples_leaf=1))])
    
params = {'ET__n_estimators': [1500],
          'ET__max_features': ['auto', None],
          'ET__min_samples_split': [2, 4, 10],
          'ET__min_samples_leaf': [1, 2, 5],
          #'feature_selection__C': [0.1, 1, 10],
          #'ET__criterion' : ['gini', 'entropy'],
          #'imputer__strategy': ['median', 'mean'],
          'low_var_remover': [0, 0.1, .25, .50, .75]
          }
    
grid_search = GridSearchCV(pipeline, param_grid=params, cv=sk_fold, n_jobs = -1, scoring='f1')
grid_search.fit(X_df, y=y_df)
#test_pred = grid_search.predict(X_test)
#print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
print "Best Estimator: ", grid_search.best_estimator_
    #f1_avg.append(f1_score(y_test, test_pred))
#print "F1: ", f1_score(y_test, test_pred)
#print "Confusion Matrix: "
#print confusion_matrix(y_test, test_pred)
#print "Accuracy Score: ", accuracy_score(y_test, test_pred)
print "Best Params: ", grid_search.best_params_


Best Estimator:  Pipeline(steps=[('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)), ('low_var_remover', VarianceThreshold(threshold=0.1)), ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
           criterion='gini', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
           min_samples_split=4, n_estimators=1500, n_jobs=1,
           oob_score=True, random_state=None, verbose=0))])
Best Params:  {'ET__n_estimators': 1500, 'ET__min_samples_split': 4, 'low_var_remover': 0.25, 'ET__max_features': None, 'ET__min_samples_leaf': 1}

In [88]:
n_iter = 100
sk_fold = StratifiedShuffleSplit(y_df, n_iter=n_iter, test_size=0.1)
f1_avg = []
for train_index, test_index in sk_fold:
    X_train, X_test = X_df.irow(train_index), X_df.irow(test_index)
    y_train, y_test = y_df[train_index], y_df[test_index]

    grid_search.best_estimator_.fit(X_train, y=y_train)
    # pipeline.fit(X_train, y=y_train)
    test_pred = grid_search.predict(X_test)
    #test_pred = pipeline.predict(X_test)

    #print "Cross_Val_score: ", cross_val_score(grid_search, X_train, y_train)
    #print "Best Estimator: ", grid_search.best_estimator_
    #print f1_score(y_test, test_pred)
    f1_avg.append(f1_score(y_test, test_pred))
print sum(f1_avg)/n_iter


0.236333333333

In [ ]:
pipeline = Pipeline(steps=[#('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
                           #('standardizer', StandardScaler(copy=True, with_mean=True, with_std=True)),
                           #('low_var_remover', VarianceThreshold(threshold=0.1)), 
                           #('feature_selection', LinearSVC()),
                           ('features', FeatureUnion([
                                ('financial', Pipeline([
                                    ('extract', ColumnExtractor(FINANCIAL_FIELDS)),
                                    ('scale', StandardScaler()),
                                    ('reduce', LinearSVC())
                                ])),

                                ('email', Pipeline([
                                    ('extract2', ColumnExtractor(EMAIL_FIELDS)),
                                    ('scale2', StandardScaler()),
                                    ('reduce2', LinearSVC())
                                ]))

                            ])),
                           ('ET', ExtraTreesClassifier(bootstrap=True, compute_importances=None,
                                                       criterion='gini', n_estimators=1500, n_jobs=1,
                                                       oob_score=True, random_state=None, verbose=0,
                                                       max_features=None, min_samples_split=2,
                                                       min_samples_leaf=1))
                            ])

In [ ]:


In [ ]: